05. Hill Climbing Pseudocode

Hill Climbing Pseudocode

M3 L2 C05 V1

## What's the difference between G and J ?

You might be wondering: what's the difference between the return that the agent collects in a single episode ( G , from the pseudocode above ) and the expected return J ?

Well … in reinforcement learning, the goal of the agent is to find the value of the policy network weights \theta that maximizes expected return, which we have denoted by J .

In the hill climbing algorithm, the values of \theta are evaluated according to how much return G they collected in a single episode . To see that this might be a little bit strange, note that due to randomness in the environment (and the policy, if it is stochastic), it is highly likely that if we collect a second episode with the same values for \theta , we'll likely get a different value for the return G . Because of this, the (sampled) return G is not a perfect estimate for the expected return J , but it often turns out to be good enough in practice.